Frequency Distributions: Tables and Types
Frequency and Frequency Distribution
Frequency
In statistics, the frequency of a particular data value (or observation) is the number of times that value appears in the dataset. It indicates how often a specific observation occurs within the collected data.
Frequency can be determined for individual values in both quantitative and qualitative datasets.
Example 1. The marks obtained by 10 students in a short quiz are: 8, 5, 7, 8, 6, 9, 7, 8, 6, 8. Find the frequency of the mark '8'.
Answer:
Let's count how many times the value '8' appears in the list:
8, 5, 7, 8, 6, 9, 7, 8, 6, 8
The value '8' appears 4 times.
Therefore, the frequency of the mark '8' is 4.
Example 2. A small survey recorded the mode of transport used by 12 employees to get to work: Car, Train, Car, Bus, Train, Car, Car, Bus, Train, Car, Bus, Car. Find the frequency of 'Car'.
Answer:
Let's count how many times 'Car' appears in the list:
Car, Train, Car, Bus, Train, Car, Car, Bus, Train, Car, Bus, Car
The category 'Car' appears 6 times.
Therefore, the frequency of using 'Car' as the mode of transport is 6.
Frequency Distribution
A frequency distribution is a tabular or graphical representation that displays the frequencies of various data values or categories in a dataset. It organizes raw data into a structured format by showing how often each distinct value or group of values occurs.
Essentially, a frequency distribution is a summary that shows the pattern of variation in the data. It helps in understanding the concentration of values, the range, and the overall shape of the data distribution.
A frequency distribution can be presented in different forms:
- Frequency Distribution Table: A table listing values/classes and their frequencies.
- Graphical Representation: Using charts and graphs like Bar Graphs, Histograms, Frequency Polygons, etc., to visually depict the frequency distribution.
The main purpose is to condense large amounts of raw data into a more easily understandable and analyzable format.
Example 1. Based on the marks data from Example 1 under 'Frequency' (8, 5, 7, 8, 6, 9, 7, 8, 6, 8), create a frequency distribution.
Answer:
We list each distinct mark and its frequency:
Mark | Frequency |
---|---|
5 | 1 |
6 | 2 |
7 | 2 |
8 | 4 |
9 | 1 |
Total | 10 |
This table shows the frequency distribution of the marks obtained by the 10 students.
Ungrouped Frequency Distribution Table
An ungrouped frequency distribution table (also known as a simple frequency distribution table) is a method of organizing raw data by listing every distinct value that appears in the dataset and recording the number of times each value occurs. This table is suitable when the variable is discrete and has a small number of unique values, or when dealing with a small dataset where listing each value's frequency is manageable.
This table directly shows the frequency of each individual data point present in the dataset.
Construction of an Ungrouped Frequency Distribution Table
To construct an ungrouped frequency distribution table from a set of raw data, follow these steps:
- Identify Distinct Values: Examine the raw dataset and list all the unique values that appear in it. Arrange these values in ascending order (from the smallest to the largest) in the first column of your table.
- Tally the Frequencies (Optional but Recommended): Create a second column for tally marks. Go through the raw data observation by observation. For each observation, make a tally mark (|) next to the corresponding distinct value in the table. To make counting easier, group tally marks in bundles of five; the fifth mark is a diagonal line crossing the previous four ($\bcancel{||||}$).
- Count Frequencies: Count the total number of tally marks for each distinct value. This count represents the frequency ($f$) of that value. Record these frequencies in a third column labelled "Frequency".
- Calculate Total Frequency: Sum up all the frequencies in the "Frequency" column. This total sum ($\sum f$) should be equal to the total number of observations in the original raw dataset ($\text{N}$). This step serves as a check to ensure all observations have been counted correctly.
Example 1. The number of goals scored by a football team in 20 matches are: 2, 3, 0, 1, 4, 2, 1, 0, 3, 2, 4, 1, 1, 0, 2, 2, 3, 1, 2, 0. Construct an ungrouped frequency distribution table.
Answer:
The distinct values (number of goals scored) are 0, 1, 2, 3, and 4. We list these in ascending order.
Now, we tally the frequency of each score from the raw data:
- 0 goals: Appears 4 times (0, 0, 0, 0) - $||||$
- 1 goal: Appears 5 times (1, 1, 1, 1, 1) - $\bcancel{||||}$
- 2 goals: Appears 6 times (2, 2, 2, 2, 2, 2) - $\bcancel{||||}\space |$
- 3 goals: Appears 3 times (3, 3, 3) - $|||$
- 4 goals: Appears 2 times (4, 4) - $||$
Now, we put this into a table:
Goals Scored (x) | Tally Marks | Frequency (f) (No. of Matches) |
---|---|---|
0 | $||||$ | 4 |
1 | $\bcancel{||||}$ | 5 |
2 | $\bcancel{||||}\space |$ | 6 |
3 | $|||$ | 3 |
4 | $||$ | 2 |
Total | 20 |
The sum of frequencies is $4 + 5 + 6 + 3 + 2 = 20$, which matches the total number of matches given. The table is correctly constructed.
Example 2. A survey asked 15 students how many pets they own. The responses were: 1, 0, 2, 1, 1, 3, 0, 2, 1, 0, 1, 2, 2, 1, 3. Construct an ungrouped frequency distribution table for the number of pets owned by students.
Answer:
The distinct values (number of pets) are 0, 1, 2, and 3. We list these in ascending order.
Tallying the frequencies:
- 0 pets: Appears 3 times (0, 0, 0) - $|||$
- 1 pet: Appears 6 times (1, 1, 1, 1, 1, 1) - $\bcancel{||||}\space |$
- 2 pets: Appears 4 times (2, 2, 2, 2) - $||||$
- 3 pets: Appears 2 times (3, 3) - $||$
Number of Pets (x) | Tally Marks | Frequency (f) (No. of Students) |
---|---|---|
0 | $|||$ | 3 |
1 | $\bcancel{||||}\space |$ | 6 |
2 | $||||$ | 4 |
3 | $||$ | 2 |
Total | 15 |
The sum of frequencies is $3 + 6 + 4 + 2 = 15$, which matches the total number of students given. The table is correctly constructed.
Grouped Frequency Distribution Table (Class Intervals, Limits, Class Size)
When dealing with a large number of observations or when the data is continuous (meaning it can take any value within a given range), listing each individual value and its frequency becomes impractical and doesn't help in summarizing the data effectively. In such cases, the data is organized into groups or classes. A grouped frequency distribution table is then constructed, which shows the frequency of observations falling within each predefined group or class.
Need for Grouping Data
Organizing data into groups is essential for several reasons:
- Summarization: It condenses large datasets into a manageable form.
- Pattern Recognition: It helps reveal the distribution pattern of the data, such as clustering or spread.
- Simplification: It makes the data easier to interpret and understand at a glance.
- Analysis: It forms the basis for calculating various statistical measures (like mean, median, mode, standard deviation) for grouped data and for constructing graphical representations.
Key Components of a Grouped Frequency Distribution
Understanding the terminology used in grouped frequency distributions is crucial:
-
Classes / Class Intervals:
These are the mutually exclusive ranges or groups into which the raw data is divided. Each observation belongs to exactly one class interval. Examples include ranges like $0-10$, $10-20$, $20-30$ or $0-9$, $10-19$, $20-29$.
-
Class Limits:
These are the smallest and largest values that can theoretically belong to a particular class interval.
- Lower Class Limit (LCL): The smallest value included in a class interval.
- Upper Class Limit (UCL): The largest value included in a class interval (depending on the method used).
For example, in the class interval $10-20$, $10$ is typically the lower class limit and $20$ is the upper class limit.
-
Class Boundaries:
These are the real or actual dividing points between consecutive class intervals. Class boundaries are used to ensure continuity in data representation, especially for continuous data or when converting discrete intervals to a continuous scale for graphical or computational purposes. They bridge the gap that might exist between inclusive class intervals.
- For Exclusive Intervals (e.g., $10-20$, $20-30$), the upper limit of a class is the same as the lower limit of the next class. In this case, the class limits effectively serve as the class boundaries. The lower boundary of $10-20$ is $10$ and the upper boundary is $20$.
- For Inclusive Intervals (e.g., $10-19$, $20-29$), there is a gap between the upper limit of one class (19) and the lower limit of the next (20). To find the class boundaries, we calculate an adjustment factor:
$\text{Adjustment Factor} = \frac{\text{Lower limit of next class} - \text{Upper limit of current class}}{2}$
For the classes $10-19$ and $20-29$, the adjustment factor is $\frac{20 - 19}{2} = \frac{1}{2} = 0.5$.
The class boundaries are then calculated as:
Lower Class Boundary = Lower Class Limit - Adjustment Factor
Upper Class Boundary = Upper Class Limit + Adjustment Factor
So, for the inclusive class $10-19$, the boundaries are $10 - 0.5 = 9.5$ (Lower Boundary) and $19 + 0.5 = 19.5$ (Upper Boundary).
Class boundaries are crucial for constructing Histograms and Ogives, and for calculating the Median and Mode of grouped data.
-
Class Size / Width ($h$ or $i$ or $c$):
This is the difference between the upper class boundary and the lower class boundary of a class interval. If using exclusive limits, it is the difference between the upper and lower limits.
- For exclusive intervals (e.g., $10-20$), Class Size $= \text{Upper Limit} - \text{Lower Limit} = 20 - 10 = 10$.
- For inclusive intervals (e.g., $10-19$, with boundaries $9.5-19.5$), Class Size $= \text{Upper Boundary} - \text{Lower Boundary} = 19.5 - 9.5 = 10$.
Ideally, the class size should be constant for all class intervals in a frequency distribution to maintain consistency and facilitate comparison.
-
Class Mark / Midpoint ($x_i$ or $x_m$):
This is the representative value for each class interval. It is calculated as the average of the lower and upper class limits (or boundaries).
$\text{Class Mark} = \frac{\text{Lower Limit + Upper Limit}}{2}$
$\text{Class Mark} = \frac{\text{Lower Boundary + Upper Boundary}}{2}$
For the class $10-20$, the class mark is $\frac{10+20}{2} = 15$. For the class $10-19$, the class mark is $\frac{10+19}{2} = 14.5$. Class marks are used in calculations like finding the arithmetic mean for grouped data.
-
Frequency ($f$):
This is the number of observations from the original data set that fall into a specific class interval.
Types of Class Intervals
Class intervals can be defined using two primary methods:
-
Exclusive Method (or Continuous Method):
In this method, the class intervals are defined such that the upper limit of one class is the lower limit of the next class (e.g., $0-10$, $10-20$, $20-30$). The upper limit of a class is excluded from that class and is included in the next class. For instance, an observation with value $10$ would be included in the $10-20$ class, not the $0-10$ class. This method is particularly suitable for continuous data where values can fall anywhere within a range.
-
Inclusive Method (or Discrete Method):
In this method, the class intervals are defined such that both the lower and upper limits are included within the same class (e.g., $0-9$, $10-19$, $20-29$). There is a distinct gap between the upper limit of one class and the lower limit of the next. This method is generally used for discrete data where values are usually integers. When using this method for graphical representations like histograms or for calculating the median/mode of grouped data, it is necessary to convert the limits into class boundaries to ensure continuity.
Construction of a Grouped Frequency Distribution Table
Follow these steps to construct a grouped frequency distribution table from raw data:
- Determine the Range of the Data: Calculate the difference between the highest and lowest values in the raw data set.
Range = Maximum Value - Minimum Value
- Decide the Number of Classes ($k$): Choose an appropriate number of class intervals. While there's no strict rule, typically $5$ to $15$ classes are used. Too few classes oversimplify the data; too many classes defeat the purpose of grouping. Sturges' rule provides a guideline:
$\text{Number of Classes} (k) \approx 1 + 3.322 \log_{10} N$
... (1)
where $N$ is the total number of observations. The result should be rounded up to the next whole number.
- Determine the Class Size / Width ($h$): Calculate the approximate width of each class interval by dividing the range by the chosen number of classes.
$\text{Approximate Class Size} (h) \approx \frac{\text{Range}}{\text{Number of Classes}}$
Round this value up to a convenient number (e.g., 5, 10, 20, 100) to make the class limits easy to work with. It is best to keep the class size constant for all intervals.
- Set up Class Limits or Boundaries: Choose a starting value for the first class. This value should be equal to or slightly less than the minimum value in the data. Then, determine the upper limit/boundary using the chosen class size and the chosen method (exclusive or inclusive). Continue setting up subsequent class intervals until the maximum value of the data is included in the last class.
- Tally the Observations: Go through each observation in the raw data. For each observation, place a tally mark ($|$) in the row of the class interval to which it belongs. Remember the rule for exclusive intervals (upper limit excluded) and inclusive intervals (both limits included). Use blocks of five tally marks for easier counting ($\bcancel{||||}$).
- Count the Frequency: Count the number of tally marks in each row to get the frequency ($f$) for each class interval.
- Sum the Frequencies: Add up the frequencies of all class intervals. This sum should be equal to the total number of observations ($N$). If it's not, recheck the tallying and counting process.
Example
Example 1. The weights (in kg) of 30 students are given below. Construct a grouped frequency distribution table using exclusive class intervals of size 5, starting from 40 kg.
45 | 52 | 61 | 48 | 55 | 63 | 70 | 51 | 60 | 68 |
58 | 42 | 49 | 57 | 65 | 62 | 59 | 46 | 53 | 66 |
41 | 56 | 64 | 50 | 69 | 54 | 47 | 58 | 61 | 56 |
Answer:
Given: Raw data of weights (in kg) of 30 students.
To Construct: Grouped frequency distribution table with exclusive classes of size 5, starting from 40 kg.
Solution:
The minimum value in the data is $41$ kg and the maximum value is $70$ kg.
Range = Maximum Value - Minimum Value = $70 - 41 = 29$ kg.
The desired class size is $h=5$.
The starting point for the first class is $40$. Since exclusive intervals are required (where the upper limit is excluded from the class), the class intervals will be:
- $40 - 45$ (includes values $\geq 40$ and $< 45$)
- $45 - 50$ (includes values $\geq 45$ and $< 50$)
- $50 - 55$ (includes values $\geq 50$ and $< 55$)
- $55 - 60$ (includes values $\geq 55$ and $< 60$)
- $60 - 65$ (includes values $\geq 60$ and $< 65$)
- $65 - 70$ (includes values $\geq 65$ and $< 70$)
- $70 - 75$ (includes values $\geq 70$ and $< 75$)
We continue creating intervals until the maximum value (70) is included, which falls in the 70-75 interval.
Now, we tally the given 30 observations by placing a tally mark ($|$) in the appropriate class interval. We then count the tally marks to find the frequency ($f$) for each class.
Weight (kg) (Exclusive Class Intervals) |
Tally Marks | Frequency (f) (No. of Students) |
---|---|---|
40 - 45 | $||$ | 2 |
45 - 50 | $\bcancel{||||}$ | 5 |
50 - 55 | $\bcancel{||||}$ | 5 |
55 - 60 | $\bcancel{||||} \ ||$ | 7 |
60 - 65 | $\bcancel{||||} \ |$ | 6 |
65 - 70 | $||||$ | 4 |
70 - 75 | $|$ | 1 |
Total | 30 |
The sum of frequencies is $2+5+5+7+6+4+1 = 30$, which equals the total number of students, confirming the tallying is correct.
Cumulative Frequency and Cumulative Frequency Distribution Table
In addition to knowing how many observations fall within each class interval (frequency), it is often useful to know the total number of observations that fall below or above a certain value. This is where the concept of cumulative frequency comes into play.
Cumulative frequency refers to the running total of frequencies. It helps in understanding the distribution of data by showing how many observations are less than or equal to a particular value, or greater than or equal to a particular value.
Types of Cumulative Frequency
There are two main types of cumulative frequency, depending on whether we are accumulating frequencies from the lower end or the upper end of the distribution:
-
Less Than Cumulative Frequency (cf):
The 'less than' cumulative frequency for a class interval is the sum of the frequencies of that class and all classes preceding it. It indicates the total count of observations whose values are less than the upper boundary of the respective class interval.
To calculate the 'less than' cumulative frequency:
- Start with the frequency of the first class as the 'less than' cumulative frequency for that class (or for the upper boundary of that class).
- For each subsequent class, add its frequency to the cumulative frequency of the previous class.
- The 'less than' cumulative frequency for the last class (or its upper boundary) will be equal to the total number of observations ($N$).
('Less Than' CF of current class) = ('Less Than' CF of previous class) + (Frequency of current class)
-
More Than Cumulative Frequency:
The 'more than' cumulative frequency for a class interval is the sum of the frequencies of that class and all classes succeeding it. It indicates the total count of observations whose values are greater than or equal to the lower boundary of the respective class interval.
To calculate the 'more than' cumulative frequency:
- Start with the total number of observations ($N$) as the 'more than' cumulative frequency for the lower boundary of the first class (since all observations are greater than or equal to the first lower boundary).
- For each subsequent class (using its lower boundary), subtract the frequency of the *previous* class from the cumulative frequency calculated for the previous class's lower boundary.
- The 'more than' cumulative frequency for the lower boundary of the last class will be equal to the frequency of the last class.
('More Than' CF for current lower boundary) = ('More Than' CF for previous lower boundary) - (Frequency of previous class)
Alternatively, sum frequencies starting from the bottom-most class upwards.
Cumulative Frequency Distribution Table
A cumulative frequency distribution table presents the cumulative frequencies against the corresponding class boundaries or limits.
1. Less Than Cumulative Frequency Distribution Table
This table typically lists the upper boundaries of the class intervals and their corresponding 'less than' cumulative frequencies.
Using the weight data from Example 1 (from section I3) with frequencies 2, 5, 5, 7, 6, 4, 1 and Total N=30:
Weight (kg) (Less than Upper Boundary) |
Frequency (f) | Less Than Cumulative Frequency (cf) |
---|---|---|
Less than 45 | 2 | 2 |
Less than 50 | 5 | $2 + 5 = 7$ |
Less than 55 | 5 | $7 + 5 = 12$ |
Less than 60 | 7 | $12 + 7 = 19$ |
Less than 65 | 6 | $19 + 6 = 25$ |
Less than 70 | 4 | $25 + 4 = 29$ |
Less than 75 | 1 | $29 + 1 = 30$ |
Total | 30 |
Interpretation: From this table, we can see, for example, that 12 students have weights less than 55 kg, and 30 students have weights less than 75 kg (which is the total number of students).
2. More Than Cumulative Frequency Distribution Table
This table typically lists the lower boundaries of the class intervals and their corresponding 'more than' cumulative frequencies.
Using the weight data from Example 1 with frequencies 2, 5, 5, 7, 6, 4, 1 and Total N=30:
Weight (kg) (More than or equal to Lower Boundary) |
Frequency (f) | More Than Cumulative Frequency |
---|---|---|
More than or equal to 40 | 2 | 30 |
More than or equal to 45 | 5 | $30 - 2 = 28$ |
More than or equal to 50 | 5 | $28 - 5 = 23$ |
More than or equal to 55 | 7 | $23 - 5 = 18$ |
More than or equal to 60 | 6 | $18 - 7 = 11$ |
More than or equal to 65 | 4 | $11 - 6 = 5$ |
More than or equal to 70 | 1 | $5 - 4 = 1$ |
Total | 30 |
Interpretation: From this table, we can see, for example, that 18 students have weights of 55 kg or more, and 1 student has a weight of 70 kg or more.
Importance of Cumulative Frequency
Cumulative frequency distributions are fundamental tools in statistics for:
- Quickly answering questions about the number or proportion of observations above or below specific values.
- Constructing graphical representations like Ogives (Cumulative Frequency Curves), which are used to estimate the median, quartiles, and percentiles graphically.
- Calculating the median, quartiles, and percentiles using formulas for grouped data.